Introduction

This markdown document is designed to briefly show the revised results pertaining to the MitoImpute. We noticed that HaploGrep2 is able to capture more haplogroups than the method of haplogroup assignment we were previously using, HiMC. Therefore, we have generated results to display the old HiMC outputs, as well as the new HaploGrep outputs. Additionally, I have included string distances between the ‘truth’ haplogroupings, assigned from the multiple sequence alignment, and quality scores

Minor allele frequency experiments

This section will detail the minor allele frequency experiments.

## Rows: 387
## Columns: 71
## $ array                                           <fct> BDCHP-1X10-HUMANHAP24…
## $ mcmc                                            <chr> "MCMC1", "MCMC1", "MC…
## $ refpan_maf                                      <ord> MAF1%, MAF1%, MAF1%, …
## $ k_hap                                           <ord> kHAP500, kHAP500, kHA…
## $ imputed                                         <lgl> TRUE, FALSE, FALSE, F…
## $ info_cutoff                                     <dbl> 0.3, NA, NA, NA, NA, …
## $ n_snps_array                                    <dbl> 309, NA, NA, NA, NA, …
## $ n_snps_imputed                                  <dbl> 483, NA, NA, NA, NA, …
## $ n_snps_cutoff_imputed                           <dbl> 467, NA, NA, NA, NA, …
## $ n_type_0                                        <dbl> 181, NA, NA, NA, NA, …
## $ n_type_1                                        <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2                                        <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3                                        <dbl> 73, NA, NA, NA, NA, 4…
## $ n_type_0_cutoff                                 <dbl> 165, NA, NA, NA, NA, …
## $ n_type_1_cutoff                                 <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2_cutoff                                 <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3_cutoff                                 <dbl> 73, NA, NA, NA, NA, 4…
## $ mean_info                                       <dbl> 0.8791739, NA, NA, NA…
## $ mean_info_cutoff                                <dbl> 0.9037966, NA, NA, NA…
## $ mean_maf                                        <dbl> 0.06190269, NA, NA, N…
## $ mean_maf_cutoff                                 <dbl> 0.06381799, NA, NA, N…
## $ mean_mcc                                        <dbl> 0.8179815, NA, NA, NA…
## $ mean_mcc_cutoff                                 <dbl> 0.8727745, NA, NA, NA…
## $ mean_concordance                                <dbl> 0.9958531, NA, NA, NA…
## $ mean_concordance_cutoff                         <dbl> 0.9959055, NA, NA, NA…
## $ mean_certainty                                  <dbl> 0.9973703, NA, NA, NA…
## $ mean_certainty_cutoff                           <dbl> 0.9974721, NA, NA, NA…
## $ mean_himc_concordance_typed                     <dbl> 0.9806553, NA, NA, NA…
## $ mean_himc_concordance_typed_macro               <dbl> 0.9936834, NA, NA, NA…
## $ mean_himc_concordance_imputed                   <dbl> 0.9885511, NA, NA, NA…
## $ mean_himc_concordance_imputed_cutoff            <dbl> 0.9885511, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro             <dbl> 0.9984208, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro_cutoff      <dbl> 0.9984208, NA, NA, NA…
## $ mean_haplogrep_concordance_typed                <dbl> 0.3062352, NA, NA, NA…
## $ mean_haplogrep_concordance_typed_macro          <dbl> 0.9932912, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed              <dbl> 0.2841358, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_cutoff       <dbl> 0.2892660, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro        <dbl> 0.9940805, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro_cutoff <dbl> 0.9940805, NA, NA, NA…
## $ mean_haplogrep_quality_truth                    <dbl> 0.8560609, NA, NA, NA…
## $ mean_haplogrep_quality_typed                    <dbl> 0.9822484, NA, NA, NA…
## $ mean_haplogrep_quality_imputed                  <dbl> 0.9785349, NA, NA, NA…
## $ mean_haplogrep_quality_imputed_cutoff           <dbl> 0.9789348, NA, NA, NA…
## $ mean_haplogrep_distance_dl_typed                <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed              <dbl> 2.160616, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed_cutoff       <dbl> 2.123125, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_typed                <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed              <dbl> 2.160616, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed_cutoff       <dbl> 2.123125, NA, NA, NA,…
## $ mean_haplogrep_distance_jc_typed                <dbl> 0.2800019, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed              <dbl> 0.3019733, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed_cutoff       <dbl> 0.2982575, NA, NA, NA…
## $ himc_diff                                       <dbl> 0.007895776, NA, NA, …
## $ himc_cutoff_diff                                <dbl> 0.007895776, NA, NA, …
## $ himc_macro_diff                                 <dbl> 0.004737465, NA, NA, …
## $ himc_macro_cutoff_diff                          <dbl> 0.004737465, NA, NA, …
## $ haplogrep_diff                                  <dbl> -0.022099448, NA, NA,…
## $ haplogrep_cutoff_diff                           <dbl> -0.016969219, NA, NA,…
## $ haplogrep_macro_diff                            <dbl> 0.000789266, NA, NA, …
## $ haplogrep_macro_cutoff_diff                     <dbl> 0.000789266, NA, NA, …
## $ haplogrep_quality_diff                          <dbl> -0.003713536, NA, NA,…
## $ haplogrep_quality_cutoff_diff                   <dbl> -0.003313575, NA, NA,…
## $ haplogrep_quality_diff_truth_typed              <dbl> -0.1261875, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed            <dbl> -0.1224739, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed_cutoff     <dbl> -0.1228739, NA, NA, N…
## $ haplogrep_distance_dl_diff                      <dbl> 0.2951855, NA, NA, NA…
## $ haplogrep_distance_dl_cutoff_diff               <dbl> 0.2576953, NA, NA, NA…
## $ haplogrep_distance_lv_diff                      <dbl> 0.2951855, NA, NA, NA…
## $ haplogrep_distance_lv_cutoff_diff               <dbl> 0.2576953, NA, NA, NA…
## $ haplogrep_distance_jc_diff                      <dbl> 0.0219713276, NA, NA,…
## $ haplogrep_distance_jc_cutoff_diff               <dbl> 0.018255556, NA, NA, …

HiMC

HiMC Haplogrouping

We previously found that imputing missing variants increased the accuracy of haplogroup assignments when using HiMC to assign haplogroups.

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0029833 0.0014916 0.0441965 0.9567721
Residuals 304 10.2600244 0.0337501 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8600744 0.0182800 304 0.8241030 0.8960458
MAF0.5% 0.8551735 0.0181017 304 0.8195530 0.8907939
MAF0.1% 0.8525313 0.0181017 304 0.8169108 0.8881517
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0049010 0.0257261 304 0.1905065 0.9801924
MAF1% - MAF0.1% 0.0075432 0.0257261 304 0.2932113 0.9537218
MAF0.5% - MAF0.1% 0.0026422 0.0255996 304 0.1032120 0.9941443
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0054205 0.0027102 0.1366648 0.8723168
Residuals 300 5.9493818 0.0198313 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.3103129 0.0140125 300 0.2827377 0.3378881
MAF0.5% 0.3203330 0.0140125 300 0.2927579 0.3479082
MAF0.1% 0.3176033 0.0140125 300 0.2900282 0.3451785
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0100201 0.0198166 300 -0.5056419 0.8686475
MAF1% - MAF0.1% -0.0072904 0.0198166 300 -0.3678945 0.9281329
MAF0.5% - MAF0.1% 0.0027297 0.0198166 300 0.1377474 0.9895942

HiMC Macrohaplogrouping

The trend of which can be further seen when only macro-haplogroups are considered:
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0059374 0.0029687 0.0926436 0.911544
Residuals 304 9.7415101 0.0320444 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8938627 0.0178121 304 0.8588120 0.9289133
MAF0.5% 0.8883512 0.0176383 304 0.8536425 0.9230599
MAF0.1% 0.8830726 0.0176383 304 0.8483639 0.9177813
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0055115 0.0250676 304 0.2198638 0.9737057
MAF1% - MAF0.1% 0.0107901 0.0250676 304 0.4304400 0.9029618
MAF0.5% - MAF0.1% 0.0052786 0.0249444 304 0.2116161 0.9756173
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0043381 0.0021691 0.0892709 0.9146221
Residuals 300 7.2892487 0.0242975 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.2363714 0.0155103 300 0.2058486 0.2668942
MAF0.5% 0.2455937 0.0155103 300 0.2150710 0.2761165
MAF0.1% 0.2401832 0.0155103 300 0.2096605 0.2707060
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0092223 0.0219349 300 -0.4204415 0.9072014
MAF1% - MAF0.1% -0.0038118 0.0219349 300 -0.1737785 0.9834902
MAF0.5% - MAF0.1% 0.0054105 0.0219349 300 0.2466630 0.9670199

These results suggest that there is no statistically significant difference in accurate assignment of haplogroups or macrohaplogroups between different Reference Panel minor allele frequency filtering thresholds.

HaploGrep 2.0

HaploGrep Haplogrouping

We are investigating using HaploGrep 2.0 for assigning haplogroups, as HaploGrep has a greater ability to assign haplogroups that cover all sub-groupings.

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0976431 0.0488216 4.789485 0.0089547
Residuals 304 3.0988201 0.0101935 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.1672931 0.0100462 304 0.1475243 0.1870620
MAF0.5% 0.1872476 0.0099482 304 0.1676716 0.2068236
MAF0.1% 0.2109831 0.0099482 304 0.1914071 0.2305590
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0199545 0.0141383 304 -1.411377 0.3362773
MAF1% - MAF0.1% -0.0436899 0.0141383 304 -3.090183 0.0061722
MAF0.5% - MAF0.1% -0.0237355 0.0140688 304 -1.687096 0.2117763
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.1160262 0.0580131 163.6211 0
Residuals 304 0.1077856 0.0003546 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% -0.0395844 0.0018736 304 -0.0432713 -0.0358975
MAF0.5% -0.0156206 0.0018553 304 -0.0192715 -0.0119696
MAF0.1% 0.0081149 0.0018553 304 0.0044639 0.0117658
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0239639 0.0026368 304 -9.088190 0
MAF1% - MAF0.1% -0.0476993 0.0026368 304 -18.089758 0
MAF0.5% - MAF0.1% -0.0237355 0.0026239 304 -9.046021 0

HaploGrep Macrohaplogrouping

The trend of which can be further seen when only macro-haplogroups are considered:
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0012794 0.0006397 0.019601 0.9805911
Residuals 304 9.9213982 0.0326362 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8842553 0.0179758 304 0.8488825 0.9196281
MAF0.5% 0.8792615 0.0178005 304 0.8442338 0.9142892
MAF0.1% 0.8813994 0.0178005 304 0.8463717 0.9164271
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0049939 0.0252980 304 0.1974015 0.9787485
MAF1% - MAF0.1% 0.0028559 0.0252980 304 0.1128921 0.9929985
MAF0.5% - MAF0.1% -0.0021379 0.0251736 304 -0.0849267 0.9960315
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0085920 0.0042960 2.340219 0.0980394
Residuals 304 0.5580574 0.0018357 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.0021255 0.0042633 304 -0.0062637 0.0105148
MAF0.5% 0.0121608 0.0042217 304 0.0038534 0.0204682
MAF0.1% 0.0142987 0.0042217 304 0.0059914 0.0226061
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0100353 0.0059998 304 -1.6725957 0.2174178
MAF1% - MAF0.1% -0.0121732 0.0059998 304 -2.0289253 0.1070455
MAF0.5% - MAF0.1% -0.0021379 0.0059703 304 -0.3580893 0.9317786

HaploGrep Haplogrouping (with info > 0.3 cutoff)

It should be noted that, by convention, imputed variants with an IMPUTE2 info score of info <= 0.3 are excluded from the final datasets. As such, I have also displayed these results where I have excluded any imputed sites within an info score info <= 0.3.

Imputed haplogroup corcordance, :
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Difference in haplogroup concordance between genotyped and imputed datasets with (cutoff info <= 0.3):
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0254557 0.0127278 1.252717 0.2872478
Residuals 294 2.9870905 0.0101602 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.1822736 0.0100297 294 0.1625344 0.2020127
MAF0.5% 0.1972935 0.0099319 294 0.1777469 0.2168401
MAF0.1% 0.2046356 0.0104522 294 0.1840649 0.2252062
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0150200 0.0141152 294 -1.0640995 0.5371477
MAF1% - MAF0.1% -0.0223620 0.0144860 294 -1.5436953 0.2720877
MAF0.5% - MAF0.1% -0.0073421 0.0144184 294 -0.5092128 0.8669185
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0739095 0.0369548 129.5151 0
Residuals 294 0.0838876 0.0002853 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% -0.0246040 0.0016808 294 -0.0279119 -0.0212961
MAF0.5% -0.0055747 0.0016644 294 -0.0088503 -0.0022990
MAF0.1% 0.0144649 0.0017516 294 0.0110176 0.0179122
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0190293 0.0023654 294 -8.044750 0
MAF1% - MAF0.1% -0.0390689 0.0024276 294 -16.093751 0
MAF0.5% - MAF0.1% -0.0200396 0.0024163 294 -8.293641 0

HaploGrep Macrohaplogrouping (with info ≥ 0.3 cutoff)

The trend of which can be further seen when only macro-haplogroups are considered:

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0063379 0.0031690 0.1089308 0.8968286
Residuals 294 8.5529105 0.0290915 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8900459 0.0169716 294 0.8566447 0.9234471
MAF0.5% 0.8839626 0.0168060 294 0.8508872 0.9170379
MAF0.1% 0.8786272 0.0176865 294 0.8438190 0.9134354
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0060833 0.0238847 294 0.2546947 0.9648766
MAF1% - MAF0.1% 0.0114186 0.0245122 294 0.4658356 0.8873319
MAF0.5% - MAF0.1% 0.0053354 0.0243978 294 0.2186813 0.9739841
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0098631 0.0049316 2.410029 0.0915852
Residuals 294 0.6016025 0.0020463 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.0079161 0.0045011 294 -0.0009424 0.0167746
MAF0.5% 0.0168619 0.0044572 294 0.0080899 0.0256340
MAF0.1% 0.0219469 0.0046907 294 0.0127152 0.0311785
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0089458 0.0063346 294 -1.4122253 0.3358850
MAF1% - MAF0.1% -0.0140308 0.0065010 294 -2.1582526 0.0802339
MAF0.5% - MAF0.1% -0.0050850 0.0064707 294 -0.7858469 0.7120262

These results suggest that there is a statistically significant difference in accurate assignment of haplogroups between different Reference Panel minor allele frequency filtering thresholds. However, this improvement is tiny; therefore, the biological and practical significance of the improvement seems small.

These results suggest that there is no statistically significant difference in accurate assignment of macrohaplogroups between different Reference Panel minor allele frequency filtering thresholds. However, it should be noted that both the genotyped and imputed datasets allow HaploGrep to accurately call macrohaplogroups, with average accuracy in the high 80%s.

There is a slight increase in ability to accuracy call haplogroups when a filter of info > 0.3 is applied, but the biological and practical significance of the improvement again seems small.

HaploGrep haplogroup quality comparisons

We also examined the difference in HaploGrep’s quality score between the truthset, genotyped set, and imputed set.

Here I show the difference between the truth set and the genotyped set:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.

Here I show the difference between the truth set and the imputed set:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Here I show the difference between the truth set and the imputed set with the info score filter info > 0.3:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Here it appears that relative to the truth set, the quality is still decreased.

However, I have also investigated the difference between the genotyped and imputed datasets to see if there is any improvement. I have only investigated the imputed dataset filtered with info > 0.3.
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

On average, there is a decrease in HaploGrep quality score.

HaploGrep string distance (Damerau-Levenshtein)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Damerau-Levenshtein distance:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Damerau-Levenshtein string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 4.239264 2.1196323 15.10913 6e-07
Residuals 294 41.244733 0.1402882 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Damerau-Levenshtein string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.3900615 0.0372692 294 0.3167133 0.4634097
MAF0.5% 0.1255738 0.0369056 294 0.0529412 0.1982063
MAF0.1% 0.1539571 0.0388391 294 0.0775192 0.2303950
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Damerau-Levenshtein string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.2644877 0.0524501 294 5.0426543 0.0000024
MAF1% - MAF0.1% 0.2361044 0.0538281 294 4.3862644 0.0000477
MAF0.5% - MAF0.1% -0.0283833 0.0535770 294 -0.5297671 0.8567930

DERLETE LATER

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.